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Abstract 

Over  twenty-five  organizations  participating  in  the  First  Census  OCR  Systems  Con- 
ference submitted  confidence  data  as  well  as  character  classification  data  for  the  digit 
test  in  that  Conference.  A three  parameter  function  of  the  rejection  rate  r is  fit  to 
the  error  rate  versus  rejection  rate  data  derived  from  this  data,  and  found  to  fit  it 
very  well  over  the  range  from  r = 0 to  r = 0.15.  The  probability  distribution  un- 
derlying the  model  e(r)  curve  is  derived  and  shown  to  correspond  to  an  inherently 
inefficient  rejection  process.  With  only  a few  exceptions  that  seem  to  be  insignificant, 
all  of  the  organizations  submitting  data  to  the  Conference  for  scoring  seem  to  employ 
this  same  rejection  process  with  a remarkable  uniformity  of  efficiency  with  respect  to 
the  maximum  efficiency  allowed  for  this  process.  Two  measures  of  rejection  efficiency 
are  derived,  and  a practical  definition  of  ideal  OCR  performance  in  the  classification 
of  segmented  characters  is  proposed.  Perfect  rejection  is  shown  to  be  achievable,  but 
only  at  the  cost  of  reduced  classification  accuracy  in  most  practical  situations.  Human 
classification  of  a subset  of  the  digit  test  suggests  that  there  is  considerable  room  for 
improvement  in  machine  OCR  before  performance  at  the  level  of  the  proposed  ideal  is 
achieved. 


1 Introduction 


Over  40  different  OCR  systems  using  different  preprocessing,  feature  extraction,  and 
classification  algorithms  were  represented  in  the  First  Census  OCR  Systems  Conference. [1] 
The  Conference  provided  three  tests,  one  with  58,646  segmented  digits,  a second  with 
11941  segmented  upper  case  letters,  and  a third  with  12,000  segmented  lower  case  let- 
ters. Over  115  test  results  representing  different  systems  and  tests  were  submitted  to 
NIST  for  scoring  as  part  of  the  Conference. 

Most  of  the  test  results  submitted  for  scoring  were  accompanied  by  confidence  files,  and 
most  of  the  rest  by  rejection  files.  Rejection  files  contain  integers  from  the  set  {0, 1}, 
one  integer  per  test-character  image.  A 1 indicates  that  the  hypothetical  classification 
should  be  scored  as  a reject  rather  than  as  correct  or  incorrect , and  a 0 indicates  that 
the  classification  should  be  scored  as  correct  if  identical  to  the  correct  classification, 
and  incorrect  otherwise.  Each  rejection  file  defines  one  point  e(r)  on  the  error  rate  e 
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Figure  1:  Error  rate  versus  rejection  rate  for  all  systems  providing  confidence  data  with  their 
classifications  for  the  digit  test. 

versus  rejection  rate  r curve,  so  many  rejection  files  per  hypothesis  file  are  needed  to 
show  the  detailed  shape  of  the  curve. 

Confidence  files  contain  fixed  point  numbers  on  the  range  from  0.0  to  1.0  inclusive, 
one  confidence  per  test  character  image.  The  ordering  of  the  confidence  data  indicates 
the  order  in  which  the  hypothetical  classifications  should  be  rejected  as  unclassifiable 
when  generating  error  rate  versus  rejection  rate  data  for  the  given  test  and  system. 

Only  one  confidence  file  per  hypothesis  file  is  needed  to  show  the  full  detail  of  the  e(r) 
curve. 

Figure  1 (2)  shows  all  of  the  error  rate  versus  rejection  rate  e(r)  data  calculated  over  the 
range  0 < r < 0.15  for  all  of  the  systems  that  submitted  confidence  (rejection)  files  to 
the  Conference.  Figure  1 suggests  at  least  two  questions:  1)  Is  there  any  significance  to 
the  fact  that  all  of  the  curves  in  that  figure  seem  to  have  similar  shapes  with  a strong 
negative  correlation  between  e(0)  and  dlne(0)/dr,  and  2)  how  close  does  the  lower 
envelope  of  the  curves  in  that  figure  come  to  the  ideal  OCR  system  performance? 

To  answer  the  first  question,  we  derive  the  relation  between  the  function  e(r)  and  its 
underlying  probability  distribution  q{r).  We  then  show  that  the  e(r)  data  calculated 
from  the  test  results  submitted  with  confidence  files  is  well  described  over  a significant 
range  of  r by  a simple  three  parameter  equation,  and  that  the  probability  distribution 
q(r ) associated  with  this  equation  represents  an  inherently  inefficient  rejection  process 
compared  to  the  perfect  rejection  process. 

We  also  show  that  we  do  not  have  techniques  that  allow  us  to  answer  the  second 
question.  However,  comparison  with  human  classification  of  a subset  of  the  digit  test 
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Figure  2:  Error  rate  versus  rejection  rate  for  all  systems  providing  rejection  data  with  their 
classifications  for  the  digit  test. 

suggests  that  there  is  considerable  room  for  improvement  in  both  e(0)  and  in  e'(r) 
beyond  the  lower  envelope  of  the  e(r)  curve  in  Fig.  1. 

The  discussion  in  this  paper  is  confined  to  the  digit  test  of  the  First  Census  OCR 
Systems  Conference,  but  the  results  were  similar  for  the  upper  and  lower  case  letter 
tests,  with  a single  qualification:  e(0)  < 0.05,0.10,  or  0.20  for  the  digit,  upper  case, 
and  lower  case  tests,  respectively,  for  roughly  half  of  the  results  submitted  for  scoring. 

The  e(r)  curves  for  all  three  tests  are  plotted  over  the  range  0 < r < 0.50  for  all  of  the 
systems  submitting  results  in  the  Conference  report.  [1] 


2 Error  rate  versus  rejection  rate 


Let  A be  a subset  of  the  ASCII  character  set,  let  T be  a set  of  segmented  character 
images,  let  H be  a function  whose  domain  is  T and  whose  range  is  A,  and  let  R be  a 
set  of  subsets  of  T including  T,  such  that  for  each  non-empty  set  that  is  a member  of 
R there  is  one  and  only  one  set  in  R that  has  one  less  member. 

H is  a set  of  hypothetical  classifications  of  T,  and  R is  a complete  rejection  set  for  H. 
The  rejection  rate  r is  defined  for  each  subset  of  T in  R as  the  ratio  of  the  number 
of  members  of  that  subset  to  the  number  of  members  of  T.  The  classifications  in  H 
that  correspond  to  the  images  in  each  rejection  subset  of  R are  rejected  rather  than 
scored  correct  or  incorrect  to  generate  each  e(r)  point.  The  classifications  that  are  not 
rejected  are  said  to  be  accepted. 
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For  any  given  T,  the  range  of  the  variable  r is  a discrete  set,  but  for  simplicity,  we 
treat  it  as  a continuum  in  the  following  analysis.  Let  q(r ) be  the  fraction  of  the 
classifications  rejected  as  the  rejection  rate  is  changed  from  r to  r + dr.  Thus,  q(r ) is 
the  probability  as  a function  of  rejection  rate  r that  a rejected  classification  is  actually 
an  incorrect  classification.  In  this  case,  the  error  rate  e(r),  which  is  defined  as  the 
ratio  of  accepted  (unrejected)  classifications  that  are  incorrect  to  the  total  number  of 
accepted  classifications,  is  given  by 


<T)  = ‘M-m, 

1 — r 

where 

f(r)  = f q(s)ds  (2) 

Jo 

is  the  fraction  of  the  rejected  classifications  as  a function  of  r that  are  actually  incorrect, 
and  is  equal  to  r for  perfect  rejection.  Equations  1 and  2 may  be  combined  to  give  the 
slope  of  the  error  rate, 


e(0)  - f{r)  - /'(r)(l  - r)  _ e(r)  - g{r) 
(1  - r)2  1 - r 


(3) 


If  e'(r)  is  zero  in  eq.  3 , then 


q(r)  = e(r)  = ec,  (4) 

where  ec  is  a constant.  This  means  that  the  probability  of  rejecting  an  incorrect  classi- 
fication is  equal  to  the  fraction  of  incorrect  classifications  remaining  in  the  unrejected 
sample.  In  this  case,  the  rejection  mechanism  just  rejects  classifications  at  random. 

If  q(r)  is  equal  to  a constant  qc  for  r\  < r < 7*2,  then 


e(0)  - fin)  - qc(r  - r:) 
e(r)  = i — : 


(5) 


and 


e'(r)  = e(0)-.^>-fc(1-r'-)  (6) 

(1  — ry 

over  the  same  subrange.  Equation  6 can  be  written  in  terms  of  T2  instead  of  r\,  but 
due  to  the  integral  definition  of  f(r ) in  eq.  2 only  one  of  the  two  end  points  of  the 
interval  over  which  q(r)  is  constant  is  needed  to  express  the  derivative  in  this  case. 
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A perfect  rejection  mechanism  is  characterized  by 


q{r)  = h(r), 


where  h(r ) = 1 for  0 < r < e(0),  and  h(r ) = 0,  for  r > e(0),  in  which  case, 


e{r)  = h(r ) 


e(0)  - r 
1 — r 


(7) 


(8) 


3 Ideal  OCR  system  performance 


Wilkinson  and  Geist  [2]  point  out  that  it  is  not  necessarily  possible  even  in  theory  for 
an  OCR  system  to  correctly  classify  every  test  image  in  a real  sample  of  segmented 
hand-printed  characters  without  errors  due  to  reader/writer  (WR)  ambiguity.  The 
best  performance  that  can  be  postulated  for  an  ideal  OCR  system  presented  with  WR 
ambiguous  characters  is  1)  that  it  classify  every  WR  unambiguous  character  image 
correctly  and  assign  it  a confidence  of  1.0,  and  2)  that  it  classify  every  WR  ambiguous 
image  as  the  most  probable  character  and  assign  it  a confidence  equal  to  the  WR 
probability  that  the  classification  is  correct  over  the  appropriate  set  of  writers  and 
readers.  This  requires  that  the  system  ambiguity  be  identical  to  the  WR  ambiguity  for 
each  image  in  the  test  set.  Conditions  1)  and  2)  constitute  a practical  definition  of  an 
ideal  OCR  system  with  respect  to  the  task  of  classifying  segmented  characters. 

It  is  important  to  distinguish  between  the  system  probability  ps(r)  that  a classification 
is  correct  and  the  WR  probability  pwFi(r)  that  a classification  is  correct.  The  WR 
probability  is  an  a priori  probability  defined  by  a set  of  writers  and  a set  of  readers, 
which  establishes  the  upper  bound  for  the  system  probability.  On  the  other  hand,  the 
system  probability  that  a classification  is  correct  ps(r ) is  an  a posteriori  probability 
equal  to  1 — q(r).  For  Conditions  1)  and  2)  in  the  preceding  paragraph  to  hold,  it  is 
necessary  that  ps{r)  = pwR (r). 

It  is  also  important  to  understand  that  an  ideal  OCR  system  as  defined  above  will 
not  produce  the  perfect  e(r)  curve  of  eqs.  7 and  8 unless  there  are  no  WR  ambiguous 
characters  in  the  set  of  test  images.  However,  it  is  possible  to  trade  off  performance 
with  respect  to  ideal  OCR  system  behavior  to  improve  rejection  performance.  In  the 
extreme  case,  one  can  purposely  reclassify  images  with  low  confidences  incorrectly 
using  a character  that  is  not  allowed.  This  assures  that  the  probability  of  rejecting  an 
incorrect  classification  is  unity,  and  therefore  produces  the  perfect  rejection  behavior 
of  eq.  8 while  simultaneously  increasing  the  error  rate  over  the  range  of  r where  this 
strategy  is  employed.  The  bottom  line  is  that  the  overall  system  performance  at  any 
value  of  r is  no  better,  and  is  probably  worse,  but  the  rejection  process  is  perfect.  On 
the  other  hand,  this  does  not  mean  that  a near  perfect  rejection  curve  is  necessarily  a 
symptom  of  non-ideal  classification.  Perfect  rejection  is  possible  with  WR  unambiguous 
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images.  This  discussion  shows  that  the  analysis  of  e(r)  curves  requires  care  to  assure 
that  false  conclusions  are  not  drawn. 

Finally,  the  fact  that  Conditions  1)  and  2)  are  given  in  terms  of  probabilities  means 
that  an  OCR  system  satisfying  them  is  ideal  only  in  a statistical  sense.  It  is  possible 
for  a non-ideal  OCR  system  to  out-perform  the  ideal  system  on  any  given  test,  but, 
by  definition,  this  cannot  happen  for  the  ensemble  average  of  tests  over  which  the  WR 
probabilities  are  defined. 


4 Form  of  Conference  e(r)  data 


To  answer  the  first  question  posed  in  Section  1 about  Fig.  1,  we  attempted  to  fit  all  of 
the  data  in  that  figure  to  a simple  model.  A visual  examination  of  the  curves  in  that 
figure  suggests  that  they  might  be  well  described  by 


, x (e0  - em«n)exp(— r/r0)  + emin 
e(r)  = i 

To  test  this  conjecture,  we  fit  the  natural  logarithms  of  the  measured  e(r)  data  to  the 
natural  logarithm  of  eq.  9 over  the  range  0 < r < 0.15,  where  eo  > 0,  emin  > 0,  and 
ro  > 0 were  adjusted  in  the  fit.  Natural  logarithms  were  used  to  minimize  the  variance 
of  the  relative  differences  between  the  model  and  calculated  e(r)  values  rather  than 
the  variance  of  the  absolute  differences. 

The  results  of  the  fits  are  summarized  in  Table  1,  which  lists  the  values  of  eo,  emtn, 
and  r0  for  each  curve  in  Fig.  1.  This  table  also  lists  the  residual  standard  deviation  a 
of  each  fit,  and  two  ratios  R\  and  R2  that  will  be  described  later. 

Eight  data  points  were  used  in  each  fit.  Three  parameters  were  estimated.  This  leaves 
five  degrees  of  freedom  in  each  fit.  Because  the  fits  were  carried  out  on  the  natural 
logarithms  of  the  data,  the  residual  standard  deviations  of  the  fits  are  actually  the 
standard  deviations  of  the  relative  differences  between  the  measured  error  rates  and 
those  predicted  by  eq.  9.  Thus  a residual  standard  deviation  of  0.01  corresponds  to  a 
standard  deviation  of  the  relative  errors  of  the  fit  of  1%  over  the  range  of  the  fit. 

Equation  9 fits  the  data  of  Fig.  1 very  well  as  should  be  expected  from  visual  inspection 
of  that  figure;  only  two  residual  standard  deviations  are  greater  than  3%,  and  two  thirds 
are  less  than  2%.  In  fact,  most  of  the  e(r)  curves  for  all  of  the  tests  and  all  of  the 
systems  are  well  described  by  eq.  9 over  a subrange  0 < r < rs\ , and  by 


(9) 


e(r)  = es,  (10) 

over  a subrange  rsl  < r < rs 2,  where  es,  rsl,  and  rs 2 are  system  dependent  constants, 
rsi  < rs2i  and  rs2  0.15.  The  results  of  fits  of  eq.  9 to  the  e(r)  data  obtained  for  the 
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SYSTEM 

0 

e 0 

€min 

ro 

Ri 

R2 

AEG 

0.0284 

0.0347 

0.0011 

0.0525 

0.6279 

0.4889 

ASOL 

0.0319 

0.0922 

0.0000 

0.2032 

0.3971 

0.2309 

ATT_1 

0.0293 

0.0326 

0.0018 

0.0509 

0.5915 

0.3838 

ATT_2 

0.0159 

0.0363 

0.0013 

0.0533 

0.6942 

0.5558 

ATTJJ 

0.0721 

0.0505 

0.0077 

0.0481 

0.8828 

0.5745 

ATT_4 

0.0199 

0.0417 

0.0012 

0.0607 

0.6540 

0.5005 

ERIM_1 

0.0207 

0.0391 

0.0002 

0.0597 

0.6373 

0.5336 

ERIM-2 

0.0151 

0.0395 

0.0009 

0.0635 

0.6208 

0.4810 

GTESS-l 

0.0126 

0.0667 

0.0000 

0.1044 

0.6127 

0.5082 

GTESS.2 

0.0068 

0.0677 

0.0030 

0.1027 

0.6030 

0.5358 

HUGHES-l 

0.0288 

0.0501 

0.0000 

0.0846 

0.5697 

0.4142 

HUGHES.2 

0.0298 

0.0497 

0.0000 

0.0901 

0.5274 

0.4607 

IBM 

0.0144 

0.0349 

0.0016 

0.0523 

0.6233 

0.5213 

IFAX 

0.0032 

0.1703 

0.0196 

0.2062 

0.6763 

0.6724 

KODAK_L 

0.0415 

0.0490 

0.0008 

0.0764 

0.6109 

0.4184 

KODAK_2 

0.0191 

0.0413 

0.0006 

0.0708 

0.5743 

0.4117 

NESTOR 

0.0165 

0.0452 

0.0022 

0.0650 

0.7177 

0.5268 

NIST_2 

0.0041 

0.0918 

0.0000 

0.1469 

0.5872 

0.5552 

NIST_3 

0.0053 

0.0973 

0.0000 

0.1386 

0.6698 

0.6357 

NIST_4 

0.0117 

0.0501 

0.0014 

0.0782 

0.6403 

0.4941 

NYNEX 

0.0244 

0.0441 

0.0022 

0.0674 

0.6717 

0.4708 

OCRSYS 

0.0042 

0.0155 

0.0134 

0.0348 

0.0474 

0.0370 

THINK-l 

0.0093 

0.0493 

0.0017 

0.0720 

0.6928 

0.5143 

THINK-2 

0.0195 

0.0382 

0.0022 

0.0539 

0.7413 

0.5913 

UPENN 

0.0039 

0.0905 

0.0004 

0.1484 

0.5682 

0.5436 

VALEN-l 

0.0078 

0.1811 

0.0000 

0.2525 

0.6534 

0.5235 

VALEN-2 

0.0130 

0.1595 

0.0000 

0.2228 

0.6604 

0.5166 

Table  1:  Parameters  of  fit  of  eq.  9 to  data  in  Fig.  1 for  0 < r < 0.15. 


upper  case  and  lower  case  letter  tests,  which  can  be  found  in  Ref.  [1],  are  very  similar 
to  those  shown  in  Table  1.  However,  the  single  ratio  shown  in  that  reference  is  a less 
useful  efficiency  measure  than  the  two  ratios  R\  and  R2  that  are  discussed  in  the  next 
section. 

If  e(r)  satisfies  eq.  9,  as  do  the  e(r)  data  shown  in  Fig.  1,  then 


e'(0)  = - 


^o(l  Ud)  € min 

ro 


(11) 


Thus,  if  e(r)  satisfies  eq.  9,  then 


dlne(O)  _ e'(0)  _ 1 - r0  - em>n/e0 

dr  e(0)  r0 


(12) 
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Since  r0  ~ e(0)  <C  1 and  emtn  < eo  for  the  systems  in  Table  1,  dlne(0)/dr  becomes 
more  negative  as  e(0)  decreases.  This  produces  the  strong  negative  correlation  between 
e(0)  and  dlne(0)/dr  in  Fig.  1. 


5 Significance  of  shape  of  e(r)  function 


Equation  10  corresponds  to  the  case  where  the  rejection  process  has  degenerated  to  a 
random  sampling  of  the  unrejected  classifications,  as  described  in  connection  with  eq. 
4.  On  the  other  hand,  according  to  eq.  3,  eq.  9 corresponds  to  the  case  where  the 
probability  of  rejecting  a classification  that  is  actually  incorrect  is  given  by 

q(r)=— exp(— r/ro),  (13) 

ro 


which  can  be  rewritten  in  terms  of  e(r)  as 


q{r) 


e(r)(l  - r)  - emm 
ro 


(14) 


and  which  is  bounded  above  by 


q(r ) = e(r)/r0.  (15) 

The  probability  distribution  of  eq.  15  is  an  improvement  by  a factor  of  1/ro  over 
the  probability  distribution  for  a completely  random  rejection  process  given  in  eq.  4, 
but  it  is  still  greatly  inferior  to  the  distribution  for  a perfect  process.  In  fact,  no 
probability  distribution  that  is  proportional  to  e(r)  can  be  efficient,  because  the  very 
act  of  reducing  e(r ) through  the  rejection  process  reduces  the  efficiency  with  which 
incorrect  classifications  are  rejected. 

The  two  ratios  R\  and  R2  in  Table  1 address  the  efficiency  of  the  rejection  process. 
When  e(r)  satisfies  eq.  9,  e'(0)  is  given  by  eq.  11  and  is  bounded  below  by  e(0)  — 1 
according  to  eq.  6.  Thus 


_ e;(0)  _ e0(l  - r0)  - emm 

1 e(0)  - 1 r0[l-e(0)] 

in  Table  1 is  a measure  of  the  efficiency  of  a rejection  process  over  the  range  of  r (if 
any)  for  which  it  satisfies  eq.  9.  On  the  other  hand,  eq.  9 describes  a very  inefficient 
rejection  process,  so 
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(17) 


_ [e(0)  - e(r2)](l  - r2) 
r2[l  - e(0)] 

where  r2  has  a small  value,  is  a measure  of  how  efficient  the  early  part  of  the  rejection 
process  is  compared  to  the  perfect  process  described  by  eq.  8.  For  Table  1,  r2  = 
0.02.  Since  Ri  and  i?2  measure  efficiency  over  different  ranges  of  r,  they  are  not  well 
correlated  in  Table  1. 

The  question  that  this  section  addressed  was  whether  or  not  it  is  significant  that  all 
of  the  e(r)  curves  in  Fig.  1 appear  to  have  the  same  shape.  The  answer  is  yes.  All  of 
the  systems  producing  the  e(r)  data  in  that  figure  seem  to  employ  an  inherently  ineffi- 
cient rejection  process  for  which  the  probability  of  rejecting  an  incorrect  classification 
decreases  in  proportion  to  the  fraction  of  incorrect  classifications  remaining  in  the  un- 
rejected set  of  classifications.  For  all  but  three  of  these  systems  the  proportionality 
constant  ranges  from  53%  to  74%  of  the  maximum  value  consistent  with  this  type  of 
rejection  process.  Two  of  the  three  are  significantly  less  efficient,  and  the  third  is  a 
little  more  efficient  (88%),  but  has  a relatively  large  (7%)  residual  standard  deviation 
of  the  fit. 

The  use  by  23  of  26  systems  of  what  is  essentially  the  same  rejection  process  with  a 
factor  of  1.4  variation  in  its  efficiency  constitutes  surprising  uniformity  in  light  of  the 
fact  that  e(0)  ranges  over  a factor  of  more  than  5.5  for  the  same  systems,  and  the  fact 
that  these  systems  employ  diverse  preprocessing,  feature  extraction,  and  classification 
algorithms. 

Both  Figs.  1 and  2 have  one  curve  that  becomes  flat  for  very  small  r.  Both  curves  were 
obtained  from  the  same  system  because  both  rejection  files  and  confidence  files  were 
submitted  with  the  hypotheses  files  for  this  system.  This  system  had  a significantly 
better  value  for  e(0)  and  a significantly  worse  value  for  for  dln(e(0))/dr  than  any  other 
system.  There  is  also  a system  in  Fig.  2 whose  e(r)  curve  is  defined  by  only  two 
points,  but  which  employs  a rejection  process  that  is  significantly  more  efficient  than 
any  of  the  others  shown  in  Figs.  1 and  2.  However,  90%  of  the  classifications  that 
were  rejected  by  this  system  to  generate  its  second  point  (e(0.03)  = 0.0186)  in  Fig. 
2 had  been  classified  incorrectly  on  purpose  by  submitting  an  illegal  character  as  the 
hypothetical  classification.  So  e(0)  was  artificially  increased  to  improve  rejection.  The 
rest  of  the  e(r)  curves  in  Fig.  2 are  not  significantly  different  than  those  in  Fig.  1. 
Thus,  34  out  of  38  OCR  systems  show  remarkable  uniformity  in  the  nature  of  their 
rejection  process,  and  there  does  not  appear  to  be  anything  significant  from  the  point 
of  view  of  rejection  theory  about  the  4 outliers. 

Thus  the  answer  that  the  shape  of  the  e(r)  curve  signifies  a very  inefficient  rejection 
process  combined  with  the  fact  that  there  is  a surprising  uniformity  among  the  e(r) 
curves  leads  to  a new  question.  Is  the  shape  of  the  e(r)  curve  determined  in  some 
fundamental  way  by  the  data?  For  instance,  is  it  possible  that  the  WR  unambiguous 
images  are  distributed  in  image  space  in  such  a way  that  inadequacies  in  preprocess- 
ing, feature  extraction,  and  classification  generate  system  ambiguities  whose  rejection 
probabilities  are  given  by  eq.  13.  If  so,  rejection  efficiency  will  be  improved  by  the 
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same  measures  that  improve  forced  decision  accuracy.  If  not,  special  measures  would 
apparently  be  required  to  substantially  improve  rejection  efficiency. 


6 Comparison  with  human  performance 


It  is  not  clear  that  we  have  the  means  to  determine  the  ideal  e(r)  curve  for  any  given 
test.  Nevertheless,  results  of  human  classification  are  certainly  a good  start.  One  of 
the  authors  (JG)  classified  the  first  10,000  images  in  the  digit  test  under  the  same 
test  conditions  as  the  OCR  systems  represented  in  the  Conference.  The  results  were 
e(0)  = 0.0157  and  e(0.0122)  = 0.0035. 

The  human  value  for  e(0)  is  very  close  to  the  lowest  value,  e(0)  = 0.0156,  obtained  by 
any  of  the  systems  represented  in  the  Conference,  but  this  is  misleading.  All  images 
that  were  perceived  by  the  human  classifier  to  be  ambiguous  were  classified  as  question 
marks,  which  artificially  increased  e(0)  while  producing  perfect  rejection  for  0 < r < 
0.0122.  Even  a non-optimum  strategy  like  random  guessing  would  have  reduced  e(0)  by 
0.1  x 0.0122  = 0.0012.  Furthermore,  many  of  the  ambiguities  existed  between  only  two 
digits,  so  confining  the  guessing  to  the  two  most  likely  possibilities  might  have  reduced 
e(0)  by  as  much  as  0.5  x 0.0122  = 0.0061.  Thus  the  human  might  have  been  able  to 
obtain  0.0096  < e(0)  < 0.0145,  while  leaving  e(0.0122)  unchanged.  If  the  human  were 
able  to  choose  the  more  (or  most)  likely  of  the  classifications  when  ambiguities  existed, 
then  even  lower  values  for  e(0)  would  be  possible. 

Moreover,  the  fact  that  the  human  value  e(0.0122)  = 0.0035  is  well  over  a factor  of  four 
lower  than  the  lowest  value  of  e(0.0122)  in  Figs.  1 and  2 strongly  suggests  that  the  lower 
envelope  of  the  curves  in  those  figures  is  still  far  from  the  performance  of  an  ideal  OCR 
system.  The  only  caveats  are  that  the  human  performance  was  obtained  for  a single 
human  on  a single  test  that  is  a subset  of  the  test  used  for  the  OCR  systems.  Grother 
[3]  has  shown  that  it  is  unlikely  that  the  human  result  would  be  significantly  different 
for  the  complete  digit  test.  It  is  also  unlikely  that  the  factor  of  four  superiority  of  the 
human  result  is  a statistical  fluke  that  would  change  significantly  over  an  ensemble  of 
tests  involving  more  writers  and  more  human  classifiers. 

There  is  a fundamental  problem  with  using  a single  human  in  an  attempt  to  deter- 
mine the  ideal  e(r)  curve  for  a set  of  real-world  character  images  such  as  used  in  the 
Conference.  Humans  are  not  comfortable,  and  maybe  not  even  capable,  of  generating 
confidences  for  their  classifications.  Humans  with  sufficient  incentive  are  quite  happy 
rejecting  ambiguous  characters  images  while  classifying  those  that  they  find  unambigu- 
ous, but  they  are  not  so  comfortable  assigning  a single  classification  to  an  ambiguous 
image,  much  less  a confidence.  Even  the  plurality  vote  of  a large  number  of  human 
classifiers  will  suffer  from  this  problem  unless  it  happens  that  different  humans  usually 
find  different  character  images  ambiguous. 

Our  experience  suggests  that  it  might  be  possible  to  get  humans  to  generate  the  data 
needed  to  calculate  an  e(r)  curve  in  a multipass  process.  On  the  first  pass  each  human 
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would  hit  the  appropriate  keyboard  key  to  classify  the  subjectively  unambiguous  char- 
acters, and  reject  the  rest  by  typing  a question  mark.  The  second  pass  would  present 
only  the  rejected  characters  for  classification.  On  this  pass  each  human  would  hit  two 
different  keys  to  assign  two  different  classes  to  any  images  that  were  subjectively  am- 
biguous between  only  two  characters,  and  so  forth.  We  can  even  imagine  letting  the 
human  classifiers  hit  the  key  corresponding  to  each  character  of  an  ambiguous  charac- 
ter set  a number  of  times  proportional  to  his  or  her  subjective  estimate  of  the  relative 
plausibility  of  the  classification.  Still,  it  is  not  clear  that  humans  would  be  comfortable 
with  this  task  when  more  than  two-character  classifications  were  attempted.  Never- 
theless, pooling  the  results  of  a number  of  human  multipass  classifications  might  give 
a good  estimate  the  ideal  e(r)  curve  for  a given  set  of  test  images,  at  least  over  a useful 
subrange  of  r. 


7 Conclusion 


We  have  derived  the  relation  between  e(r)  and  its  underlying  probability  distribution 
q(r).  We  also  showed  that  the  e(r)  data  submitted  for  the  digit  test  of  the  First 
Census  OCR  Systems  Conference  are  well  described  for  0 < r < 0.15  by  eq.  9,  and 
that  the  corresponding  probability  distribution  q(r),  given  in  eq.  13,  describes  an 
inherently  inefficient  rejection  process  compared  to  the  perfect  rejection  process.  We 
have  introduced  some  measures  of  the  efficiency  of  the  rejection  process  for  isolated 
character  OCR,  and  have  proposed  a definition  of  ideal  performance  in  the  latter 
task.  The  definition  is  statistical  in  nature,  but  it  is  general  enough  to  allow  ideal 
performance  to  be  better  than  human  performance,  since  we  have  no  reason  to  expect 
human  performance  to  be  ideal.  We  have  also  discussed  the  difficulties  of  determining 
ideal  performance  on  any  given  test,  and  have  compared  the  digit  test  results  to  human 
classification  of  a subset  of  that  test.  The  results  suggest  that  there  is  considerable 
room  for  improvement  in  machine  OCR  before  it  can  challenge  human  performance 
for  accuracy.  Of  course,  that  does  not  mean  that  it  cannot  already  challenge  human 
performance  in  applications  where  accuracy  must  be  balanced  with  cost. 
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